Menarche and E

Age at menarche on the population-level

Approach 1: Mean + 2SD

Adapting methods described in the poster uploaded on Google Drive, I first calculated the mean pre-pubertal fE2 concentration. Because infants/juvs have elevated fE2 when nursing, we need to exclude these individuals when determining what typical pre-pubertal fE2 looks like.

Per Slack communication on 7/17: defining prebuteral E from individuals greater than 2.5 years (def not nursing) and under 3 (def not pubertal)

Per Zoom meeting on 7/22: defining prepubertal E from individuals between 3 and 4 years old.

Step 1: Preparing dataset of prepubertal individuals (3 yo)

## calculating average pre-pubertal E2 
prepubertal_data<-e_data |> 
  filter(age_at_sample >= 3.0 & age_at_sample <4) |>  # we want age 3..def pre menstruation
  select(ind_id, age_at_sample, e_conc_ug) 

Step 2: Calculating average prepubertal E

Averaging the e concentrations for all samples from individuals between 3 and 4 years of age.

prepubertal_e_data <- prepubertal_data |> 
   summarise(mean_e_conc_ug = mean(e_conc_ug, na.rm = TRUE),
            sd_e_conc_ug = sd(e_conc_ug, na.rm = TRUE)) 

mean_prepubertal_e_conc_ug <- prepubertal_e_data$mean_e_conc_ug
sd_prepubertal_e_conc_ug <- prepubertal_e_data$sd_e_conc_ug

Step 3: Calculating pubertal threshold E

Using the mean + 2SD approach:

calculated_treshold <- mean_prepubertal_e_conc_ug + 2*sd_prepubertal_e_conc_ug

print(paste("Calculated fE2 threshold for menarche: ", calculated_treshold, " ug/g"))
## [1] "Calculated fE2 threshold for menarche:  2.84527656823452  ug/g"

Step 4: Calculating population average age at menarche based on threshold E

Looking at the average age at which the population has an E reading that crosses the threshold E. Restricting dataset to individuals who we started sampling pre menstruation (before age 3.75). Only considering individuals over 4 years old to be potentially mature (i.e., only considering a threshold crossing at greater than 4 years to be indicative of menarche).

# Identify the first age where e_conc_ug rises above the threshold for each individual
first_above_threshold <- e_data %>%
  group_by(ind_id) %>%
  filter(min(age_at_sample) <= 3.75) |> # want to make sure we have samples starting before menstruation to assign cutoff age 
  ungroup() |> 
  filter(age_at_sample>4) |> # only considering potentially pubescent if older than 4
  filter(e_conc_ug > calculated_treshold) %>%
  group_by(ind_id) %>%
  summarise(first_age_above_threshold = min(age_at_sample, na.rm = TRUE)) %>%
  ungroup()

# Calculate the mean and standard error of the ages
mean_age_above_threshold <- mean(first_above_threshold$first_age_above_threshold, na.rm = TRUE)
se_age_above_threshold <- sd(first_above_threshold$first_age_above_threshold, na.rm = TRUE) / sqrt(nrow(first_above_threshold))

print(paste("Average age at which e_conc_ug rises above the threshold: ", mean_age_above_threshold))
## [1] "Average age at which e_conc_ug rises above the threshold:  4.60833333333333"
print(paste("Standard error of the average age: ", se_age_above_threshold))
## [1] "Standard error of the average age:  0.183601319288409"

Plot of fE2 as a function of age: population-level

This plot might be a bit busy, but I think it is a helpful visualization. I am very open to feedback! Here is how to interpret it:

  • The x-axis is 1-year age-bins

  • The y-axis is fE2 concentrations (untransformed, in ug/g)

  • The transparent green

  • The dashed light blue horizontal line is the calculated prepubertal fE threshold (y = 2.845277 ug/g)

  • The dashed red vertical line is the calculated average age at puberty (x = 4.641429 years). The shaded region around this is the 95% CI estimate (±1.96*SE)

  • The transparent green points are all our sample readings for samples from individuals <8 years old at time of sample collection. Unlike the boxplots, which are grouped on the x-axis by 1-year age groups, these green points are positioned on the x-axis at their exact age at sample.

  • The blue line represents the results of a Loess regression, a non-parametric approach to modeling the relationship between x (age bins) and y (fE2). The gray shading around the blue line is the error of the estimate.

pop_e_data_menarche <- e_data %>%
  filter(age_at_sample<8) |> #only wanting to visualize under age 8
  mutate(age_group = cut(age_at_sample, breaks = seq(0, max(age_at_sample, na.rm = TRUE) + 1, by = 1), right = FALSE))

## get sample size ----
n_figure_1 <- nrow(pop_e_data_menarche)

##Figure 1
par(mfrow = c(1,1))

# Calculating mean and standard error for each age group
age_summary <- pop_e_data_menarche %>%
  group_by(age_group) %>%
  summarise(
    mean_e_at_age_years = mean(e_conc_ug, na.rm = TRUE),
    se_e_at_age_years = sd(e_conc_ug, na.rm = TRUE) / sqrt(n())
  ) %>%
  ungroup()

print("Showing all datapoints with box plots for age-year groupings and smoothing")
## [1] "Showing all datapoints with box plots for age-year groupings and smoothing"
ggplot(data = pop_e_data_menarche, aes(x = as.factor(floor(age_at_sample+1)), y = e_conc_ug)) +
  geom_point(data = pop_e_data_menarche, aes(x = age_at_sample, y = e_conc_ug),
             alpha = 0.2, color = "green") +
    geom_boxplot() +
  geom_smooth(data = age_summary, aes(x = as.numeric(age_group), y = mean_e_at_age_years), method = "loess", color = "blue") +
  geom_vline(xintercept = mean_age_above_threshold, linetype = "dashed", color = "red") +
  geom_hline(yintercept = calculated_treshold, linetype = "dashed", color = "skyblue") +
  geom_rect(aes(xmin = mean_age_above_threshold - 1.96 * se_age_above_threshold, xmax = mean_age_above_threshold + 1.96 * se_age_above_threshold, 
                ymin = -Inf, ymax = Inf), alpha = 0.01, fill = "pink") +
  ggtitle(paste0("Estradiol levels by age at sample (N = ", n_figure_1, " samples)")) +
  xlab("Age at sample (years)") +
  ylab("Fecal estradiol (ug/g)") +
  scale_x_discrete(labels = c("0-1", "1-2", "2-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10"))
## `geom_smooth()` using formula = 'y ~ x'

Age at menarche on the individual-level

I am pulling data on females that we have good E coverage for before and after potential menarche…I’m looking for IDs with samples starting at least by age 3.75. At least 25 samples per ID. That leaves us with the following IDs:

*NOTE: excluding TAB because we don’t have any good pre-pubertal samples from her

females_for_menarche_panel<-e_data |> 
  group_by(ind_id) |> 
  filter(min(age_at_sample)<3.75) |>
  mutate(count = n()) |> 
  ungroup() |> 
  select(ind_id,count) |> 
  arrange(desc(count)) |> 
  distinct() |> 
  filter(count>25) |> 
  filter(ind_id != "TAB") |> 
  pull(ind_id) 

menarche_panel_df<- e_data |> 
  filter(ind_id %in% females_for_menarche_panel) |>
  filter(age_at_sample < 8) |>
  select(ind_id, age_at_sample, sample_id, date, month, time, hour, e_conc_ug, is_pregnant) |> 
  arrange(ind_id, age_at_sample) |> 
  mutate(date = as.Date(date))


print(females_for_menarche_panel)
## [1] "TER" "MIL" "TAN" "MAY" "MON" "MIS" "MUS"

Plots of fE2 as a function of age: individual-level

I am going to calculate the age at menarche for each individual using the same approach as for the population level (mean + 2sd for each individual, only considering ages >4 for potential menarche). Age at menarche for each individual is represented with the dashed red vertical line.

# Calculate average and sd estradiol for each prepubertal ID
prepubertal_e_data_by_individual <- e_data |> 
  filter(age_at_sample >= 3.0 & age_at_sample < 4) |>  # age 3, pre-menarche
  select(ind_id, age_at_sample, e_conc_ug) |>
  filter(ind_id %in% females_for_menarche_panel) |> 
  group_by(ind_id) |> 
  summarise(mean_e_conc_ug = mean(e_conc_ug, na.rm = TRUE),
            sd_e_conc_ug = sd(e_conc_ug, na.rm = TRUE)) 

# Assign threshold by ID
prepubertal_e_data_by_individual <- prepubertal_e_data_by_individual |> 
  mutate(threshold_by_id = mean_e_conc_ug + 2 * sd_e_conc_ug)

# Loop over individuals to plot estradiol levels
suppressWarnings(
for (individual in females_for_menarche_panel) {
  
  individual_data <- menarche_panel_df |> 
    filter(ind_id == individual)

  threshold_by_individual <- prepubertal_e_data_by_individual |> 
    filter(ind_id == individual) |> 
    pull(threshold_by_id)
  
  print(paste("Threshold for", individual, ":", threshold_by_individual, "ug/g"))

  # Get the age at which estradiol first exceeds the threshold
  first_above_threshold_age_by_id <- individual_data |> 
    filter(age_at_sample > 4, e_conc_ug > threshold_by_individual) |> 
    summarise(first_age_above_threshold = min(age_at_sample, na.rm = TRUE)) |> 
    pull(first_age_above_threshold)

  print(paste0("Age at menarche for ", individual, ": ", first_above_threshold_age_by_id, "years"))

  # Get duckface dates
  duckface_ages <- duckface_data %>%
    filter(ind_id == individual) %>%
    pull(age_at_sample)

  # Create the plot
  p <- ggplot(data = individual_data, aes(x = age_at_sample, y = e_conc_ug)) +
    geom_point() +
    geom_line(color = "blue") +
    geom_vline(xintercept = first_above_threshold_age_by_id,
               linetype = "dashed", color = "red") +
    ggtitle(paste0("Estradiol by age at sample for individual ", individual, " (n = ", nrow(individual_data), " samples)")) +
    xlab("Age at sample") +
    ylab("Fecal estradiol (ug/g)") +
    theme(legend.position = "none")

  print(p)
}
)
## [1] "Threshold for TER : 1.10006119055108 ug/g"
## [1] "Age at menarche for TER: 4.7years"

## [1] "Threshold for MIL : 0.196851661723125 ug/g"
## [1] "Age at menarche for MIL: 4.42years"

## [1] "Threshold for TAN : 0.47255972819919 ug/g"
## [1] "Age at menarche for TAN: 4.07years"

## [1] "Threshold for MAY : 1.98306642185859 ug/g"
## [1] "Age at menarche for MAY: 4.04years"

## [1] "Threshold for MON : 7.42743013084399 ug/g"
## [1] "Age at menarche for MON: Infyears"

## [1] "Threshold for MIS : 3.03075208724884 ug/g"
## [1] "Age at menarche for MIS: Infyears"

## [1] "Threshold for MUS : 0.699842584095534 ug/g"
## [1] "Age at menarche for MUS: 4.54years"

Adding in duckface observations to the estradiol plots

The purple vertical lines represent recordings of duckface. Visually, it doesn’t seem like duckface observations are associated with menarche.

suppressWarnings(
for (individual in females_for_menarche_panel) {
  
  individual_data <- menarche_panel_df |> 
    filter(ind_id == individual)

  threshold_by_individual <- prepubertal_e_data_by_individual |> 
    filter(ind_id == individual) |> 
    pull(threshold_by_id)
  
  print(paste("Threshold for", individual, ":", threshold_by_individual))

  # Get the age at which estradiol first exceeds the threshold
  first_above_threshold_age_by_id <- individual_data |> 
    filter(age_at_sample > 4, e_conc_ug > threshold_by_individual) |> 
    summarise(first_age_above_threshold = min(age_at_sample, na.rm = TRUE)) |> 
    pull(first_age_above_threshold)

  print(paste0("Age at menarche for ", individual, ": ", first_above_threshold_age_by_id))

  # Get duckface dates
  duckface_ages <- duckface_data %>%
    filter(ind_id == individual) %>%
    pull(age_at_sample)

  # Create the plot
  p <- ggplot(data = individual_data, aes(x = age_at_sample, y = e_conc_ug)) +
    geom_point() +
    geom_line(color = "blue") +
    geom_vline(xintercept = first_above_threshold_age_by_id,
               linetype = "dashed", color = "red") +
    geom_vline(xintercept = as.numeric(duckface_ages), 
               linetype = "solid", color = "purple") +  # Add vertical lines for duckface dates
    ggtitle(paste0("Estradiol by age at sample for individual ", individual, " (n = ", nrow(individual_data), " samples)")) +
    xlab("Age at sample") +
    ylab("Fecal estradiol (ug/g)") +
    theme(legend.position = "none")

  print(p)
}
)
## [1] "Threshold for TER : 1.10006119055108"
## [1] "Age at menarche for TER: 4.7"

## [1] "Threshold for MIL : 0.196851661723125"
## [1] "Age at menarche for MIL: 4.42"

## [1] "Threshold for TAN : 0.47255972819919"
## [1] "Age at menarche for TAN: 4.07"

## [1] "Threshold for MAY : 1.98306642185859"
## [1] "Age at menarche for MAY: 4.04"

## [1] "Threshold for MON : 7.42743013084399"
## [1] "Age at menarche for MON: Inf"

## [1] "Threshold for MIS : 3.03075208724884"
## [1] "Age at menarche for MIS: Inf"

## [1] "Threshold for MUS : 0.699842584095534"
## [1] "Age at menarche for MUS: 4.54"

Menarche and Duckface

Seeing if first rise in E is significantly associated with duckface

Considering association if peak in E occurs within 5 days of duckface observation….it doesn’t seem like it does :(. In fact, we don’t have any observations of duckface within 5 days of any potential menarche peak. This could be due to sampling bias! But, in any case, we cannot say that duckface is a signal of menarche with our data.

find_first_rise_date <- function(df, threshold) {
  df %>%
    arrange(age_at_sample) %>%
    filter(e_conc_ug > threshold) %>%
    slice(1) %>%
    pull(date)
}

# Statistical analysis to check association between duckface and first rise in estradiol
association_results <- data.frame(ind_id = character(), first_rise_date = as.Date(character()), duckface_within_2_days = logical(), duckface_dates = character(), stringsAsFactors = FALSE)

for (individual in females_for_menarche_panel) {
  individual_data <- menarche_panel_df %>%
    filter(ind_id == individual)
  
  first_rise_date <- find_first_rise_date(individual_data, calculated_treshold)
  
  duckface_dates <- duckface_data %>%
    filter(ind_id == individual) %>%
    pull(date)
  
  date_differences <- as.numeric(difftime(duckface_dates, first_rise_date, units = "days"))
  
  duckface_within_5_days <- any(abs(date_differences) <= 5)
  
  # Get the duckface dates within 2 days
  duckface_dates_within_5_days <- duckface_dates[abs(date_differences) <= 5]  
  # Join the duckface dates within 2 days into a single string
  duckface_dates_str <- paste(duckface_dates_within_5_days, collapse = ", ")
  
  association_results <- rbind(association_results, data.frame(ind_id = individual, first_rise_date = first_rise_date, duckface_within_5_days = duckface_within_5_days, duckface_dates = duckface_dates_str, stringsAsFactors = FALSE))
}


print(association_results)
##   ind_id first_rise_date duckface_within_5_days duckface_dates
## 1    TER      2021-03-22                  FALSE               
## 2    MIL      2021-06-09                  FALSE               
## 3    TAN      2021-03-26                  FALSE               
## 4    MAY      2021-04-22                  FALSE               
## 5    MON      2021-06-03                  FALSE               
## 6    MIS      2021-06-09                  FALSE               
## 7    MUS      2023-05-11                  FALSE

Cycling E

Establish Cycle Length

Pretty simple, just counting the days between local E peaks for each of our adult females.

# Panel of all individual females who have good E coverage before and after menarche
females_for_cycling_panel<-e_data |> 
  group_by(ind_id) |> 
  filter(min(age_at_sample)>4) |> 
  mutate(count = n()) |> 
  ungroup() |> 
  select(ind_id,count) |> 
  arrange(desc(count)) |> 
  distinct() |> 
  filter(count>25) |> 
  pull(ind_id)

females_for_cycling_panel_df<- e_data |> 
  filter(ind_id %in% females_for_cycling_panel) |>
  mutate(date = as.Date(date)) |> 
  # filter(date > as.Date("2021-12-31")) |>
  select(ind_id, age_at_sample, sample_id, date, month, time, hour, e_conc_ug, is_pregnant) |> 
  arrange(ind_id, age_at_sample) 
n_figure_2 <-nrow(females_for_cycling_panel_df)

create_continuous_segments <- function(df) {
  df %>%
    arrange(date) %>%
    mutate(segment = cumsum(is_pregnant == "P") + 1) %>%
    group_by(segment) %>%
    filter(any(is_pregnant == "NP")) %>%
    ungroup() %>%
    mutate(segment = cumsum(is_pregnant == "P") + 1) %>%
    filter(is_pregnant == "NP")
}

# Function to find local peaks
find_peaks <- function(df) {
  df %>%
    arrange(date) %>%
    mutate(peak = rollapply(e_conc_ug, width = 3, FUN = function(x) x[2] == max(x), fill = NA, align = 'center')) %>%
    filter(peak == TRUE) %>%
    select(ind_id, date, e_conc_ug, is_pregnant)
}

# Function to calculate cycle lengths
calculate_cycle_lengths <- function(df) {
  df <- df |> 
    filter(is_pregnant == "NP")
  
  peaks <- find_peaks(df)
  peaks <- peaks %>%
    arrange(date) %>%
    mutate(next_date = lead(date)) %>%
    mutate(cycle_length = as.numeric(difftime(next_date, date, units = "days"))) %>%
    filter(!is.na(cycle_length))
  
  # Filter to only include cycle lengths within 45 days
  valid_cycles <- peaks %>%
    filter(cycle_length <= 45)
  
  # Calculate average cycle length
  avg_cycle_length <- valid_cycles %>%
    summarise(avg_cycle_length = mean(cycle_length, na.rm = TRUE))
  
  return(avg_cycle_length)
}


# Initialize a vector to store all cycle lengths
all_cycle_lengths <- c()

# Loop through each individual and create a ggplot for each
for (individual in females_for_cycling_panel) {
  individual_data <- females_for_cycling_panel_df %>%
    filter(ind_id == individual)
  
  # Calculate cycle lengths for this individual
  cycle_lengths <- calculate_cycle_lengths(individual_data)
  
  # Add this individual's cycle lengths to the overall list
  all_cycle_lengths <- c(all_cycle_lengths, cycle_lengths)

  # Create continuous segments based on NP periods
  segmented_data <- create_continuous_segments(individual_data)
  
 p<-ggplot(data = individual_data, aes(x = date, y = e_conc_ug)) +
    geom_point(aes(color = is_pregnant)) +
    ggtitle(paste0("Estradiol by date for individual ", individual, " (n = ", nrow(individual_data), " samples)")) +
    xlab("Date") +
    ylab("Fecal estradiol (ug/g)") +
   theme(legend.position = "right") +
    labs(color = "Pregnancy Status")+
   
    geom_line(data = segmented_data, aes(group = segment), color = "blue")   # Plot lines for continuous NP segments

  print(p)
  
  # Print cycle lengths for this individual
  print(paste0("Average cycle length for individual ", individual, ": ", cycle_lengths$avg_cycle_length, " days"))
}

## [1] "Average cycle length for individual TRU: 16.3636363636364 days"

## [1] "Average cycle length for individual TAM: 23.0909090909091 days"

## [1] "Average cycle length for individual TES: 18.4285714285714 days"

## [1] "Average cycle length for individual MUL: 18.1666666666667 days"

## [1] "Average cycle length for individual MEZ: 23.6 days"

## [1] "Average cycle length for individual TET: 22.3 days"

## [1] "Average cycle length for individual TEL: 21 days"
# Calculate overall average and standard error of cycle lengths
all_cycle_lengths<-as.numeric(all_cycle_lengths)

# Calculate overall average and standard error of cycle lengths
overall_avg_cycle_length <- mean(all_cycle_lengths, na.rm = TRUE)
overall_se_cycle_length <- sd(all_cycle_lengths, na.rm = TRUE) / sqrt(length(all_cycle_lengths))
  
print(paste0("Overall average cycle length: ", overall_avg_cycle_length, " days"))
## [1] "Overall average cycle length: 20.4213976499691 days"
print(paste0("Overall standard error of cycle lengths: ", overall_se_cycle_length, " days"))
## [1] "Overall standard error of cycle lengths: 1.05350126578618 days"

Visualizing Cycling E and Duckface

Purple lines = observed duck face

for (individual in females_for_cycling_panel) {
  individual_data <- females_for_cycling_panel_df %>%
    filter(ind_id == individual)
  
  # Calculate cycle lengths for this individual
  cycle_lengths <- calculate_cycle_lengths(individual_data)
  
  # Create continuous segments based on NP periods
  segmented_data <- create_continuous_segments(individual_data)
  
  # Get duckface dates for this individual
  duckface_dates <- duckface_data %>%
    filter(ind_id == individual) %>%
    pull(date)
  
  
  p <- ggplot(data = individual_data, aes(x = date, y = e_conc_ug)) +
    geom_point(aes(color = is_pregnant)) +
    ggtitle(paste0("Estradiol by date for individual ", individual, " (n = ", nrow(individual_data), " samples)")) +
    xlab("Date") +
    ylab("Fecal estradiol (ug/g)") +
    theme(legend.position = "none") +
    geom_line(data = segmented_data, aes(group = segment), color = "blue") +  # Plot lines for continuous NP segments
    geom_vline(xintercept = as.numeric(duckface_dates), linetype = "solid", color = "purple") +  # Add vertical lines for duckface dates
    geom_vline(xintercept = as.Date(cycle_lengths$avg_cycle_length), linetype = "dashed", color = "green")

  print(p)
  
  # Print cycle lengths for this individual
  print(paste0("Average cycle length for individual ", individual, ": ", cycle_lengths$avg_cycle_length, " days"))
}

## [1] "Average cycle length for individual TRU: 16.3636363636364 days"

## [1] "Average cycle length for individual TAM: 23.0909090909091 days"

## [1] "Average cycle length for individual TES: 18.4285714285714 days"

## [1] "Average cycle length for individual MUL: 18.1666666666667 days"

## [1] "Average cycle length for individual MEZ: 23.6 days"

## [1] "Average cycle length for individual TET: 22.3 days"

## [1] "Average cycle length for individual TEL: 21 days"

Summary table: local peaks and duckface

# Combine data from all individuals
all_peaks <- data.frame()
all_duckface_events <- data.frame()

for (individual in females_for_cycling_panel) {
  individual_data <- females_for_cycling_panel_df %>%
    filter(ind_id == individual)
  
  # Find peaks for this individual
  peaks <- find_peaks(individual_data)
  all_peaks <- rbind(all_peaks, peaks)
  
  # Get duckface dates for this individual
  duckface_dates <- duckface_data %>%
    filter(ind_id == individual) %>%
    pull(date)
  
  duckface_events <- data.frame(ind_id = individual, date = duckface_dates)
  all_duckface_events <- rbind(all_duckface_events, duckface_events)
}

# Check for duckface events within ±4 days of peak estradiol dates
all_peaks <- all_peaks %>%
  mutate(duckface_within_4_days = sapply(date, function(peak_date) {
    any(abs(difftime(all_duckface_events$date[all_duckface_events$ind_id == ind_id], peak_date, units = "days")) <= 4)
  }))

# Summarize results
summary_table <- all_peaks %>%
  group_by(ind_id) %>%
  summarise(
    total_peaks = n(),
    peaks_with_duckface = sum(duckface_within_4_days),
    proportion_with_duckface = mean(duckface_within_4_days)
  )
print(summary_table)
## # A tibble: 7 × 4
##   ind_id total_peaks peaks_with_duckface proportion_with_duckface
##   <chr>        <int>               <int>                    <dbl>
## 1 MEZ             22                   3                    0.136
## 2 MUL             21                   5                    0.238
## 3 TAM             30                   9                    0.3  
## 4 TEL             16                   6                    0.375
## 5 TES             29                   9                    0.310
## 6 TET             22                   9                    0.409
## 7 TRU             31                  10                    0.323

Analysis: Cycling E and Duckface

We want to formally test if duckface covaries with E such that we can map E onto the ovarian cycle. We would expect duckface to coincide with ovulation, so I am looking for duckface occurrences within 5 days of local E peaks.

NOTE: this is a LOT of code/data wrangling…I did my best to annotate it, but please let me know if I can clarify anything.

e_data$date<-as.Date(e_data$date)

# Calculate cumulative rainfall over the last 90 days
calculate_weather_metrics <- function(combined_df, weather_data) {
  # Join the weather data with the combined dataframe
  combined_df_with_weather <- combined_df %>%
    left_join(weather_data, by = "date") %>%
    arrange(ind_id, date)
  
  # Calculate cumulative rainfall for the past 30 days
  combined_df_with_weather <- combined_df_with_weather %>%
    mutate(
      mean_rain_last_90_days = rollapply(rainfall, width = 90, FUN = mean, align = "right", fill = NA, na.rm = TRUE),
      mean_max_temp_last_30_days = rollapply(temperature, width = 30, FUN = mean, align = "right", fill = NA, na.rm = TRUE)
    ) %>%
    ungroup()
  
  return(combined_df_with_weather)
}

# finding duckface and E data within 5 days of each other
expanded_duckface_data_5_days <- duckface_data %>%
  rowwise() %>%
  do(data.frame(ind_id = .$ind_id, 
                date = seq(.$date - days(3), .$date + days(2), by = "day"),
                age_at_sample = .$age_at_sample,
                original_duckface_date = .$date
  )) %>%
  ungroup()


duckface_e_df_5_days <- expanded_duckface_data_5_days %>%
  inner_join(e_data, by = c("ind_id", "date")) |> 
  select(ind_id, sample_id, age_at_sample.x, date, original_duckface_date, e_conc_ug) %>%
  rename("sample_date" = date,
         "age_at_sample" = age_at_sample.x) %>%
  distinct(sample_id, .keep_all = TRUE) %>%  # in case there are more than 1 duckfaces observed by sampled date, only keep 1
  mutate("duckface_sample" = 1L) 

duckface_sampleids_5_days<-duckface_e_df_5_days |> 
  pull(sample_id)

non_duckface_e_df_5_days <- e_data |> 
  filter(!sample_id %in% duckface_sampleids_5_days) |> 
  select(ind_id, sample_id, age_at_sample, date, e_conc_ug) |> 
  mutate("duckface_sample" = 0L)

combined_df_5_days <- bind_rows(duckface_e_df_5_days, non_duckface_e_df_5_days) |> 
  mutate(date = coalesce(sample_date, date)) %>%
  select(-sample_date)

# calculating rainfall and maxtemp in past 30 days of e sample
combined_df_5_days <- calculate_weather_metrics(combined_df_5_days, weather_data) |> 
  filter(!is.na(mean_rain_last_90_days)) |> 
  filter(!is.na(mean_max_temp_last_30_days))

# breaking df into young and old based on age below average E threshold and age above threshold
combined_df_5_days_young<-combined_df_5_days |> 
  filter(age_at_sample < mean_age_above_threshold)

combined_df_5_days_mature <- combined_df_5_days |> 
  filter(age_at_sample >= mean_age_above_threshold)

Assessing normality of E data

Data is still not normal following log transformation. I also tried excluding outlier points using z-scores but still, not normal. I think we have a large enough sample that we can still run parametric tests on these data, but please let me know if you disagree.

shapiro.test(log(combined_df_5_days_mature$e_conc_ug))
## 
##  Shapiro-Wilk normality test
## 
## data:  log(combined_df_5_days_mature$e_conc_ug)
## W = 0.99429, p-value = 0.0004842

T-Test: Duckface E vs. Non-Duckface E

Starting with a simple T-test (not adjusting for anything!), let’s see if duckface E samples are significantly higher than non-duckface E samples.

t_test_result_5_days <- t.test(log(e_conc_ug) ~ duckface_sample, data = combined_df_5_days_mature)

print(t_test_result_5_days)
## 
##  Welch Two Sample t-test
## 
## data:  log(e_conc_ug) by duckface_sample
## t = -3.0925, df = 105.89, p-value = 0.002538
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.7819459 -0.1709997
## sample estimates:
## mean in group 0 mean in group 1 
##       0.1860851       0.6625579
ggplot(data = combined_df_5_days_mature, aes(x = factor(duckface_sample), y = log(e_conc_ug))) +
  geom_boxplot() 

So yes, a simple T-test shows us that samples within 5 days of duckface observations have significantly greater E than samples not within 5 days of duckface observations. BUT, this modeling approach is quite crude. Let’s control/adjust for other factors that will influence this relationship….

Linear Mixed-Effects Model Averaging: Duckface E vs. Non-Duckface E

I am using linear mixed-effects model averaging with the 'MuMIn' and 'lme4' packages in R. The following variables are being loaded into the full model:

Fixed/Random Predictor Description
Fixed Age at sample Age of individual at which the sample was collected
Fixed Duckface sample A binary variable for whether or not the sample was collected within 5 days of a duckface observation
Fixed Mean rain last 90 days

Average rainfall across the 90 days leading up to sample collection—from MERRA 2.

NOTE: this is a scaled, continuous variable

Fixed Mean maximum temperature last 30 days

Average maximum temperature across the 90 days leading up to sample collection—from MERRA 2

NOTE: this is a scaled, continuous variable

Random Individual ID Female ID
Running the full model
full_formula_lmer_mature <- lmer(log(e_conc_ug) ~ age_at_sample + duckface_sample + 
                            scale(mean_rain_last_90_days) +scale(mean_max_temp_last_30_days) + (1 | ind_id),
                          data = combined_df_5_days_mature, 
                          na.action = "na.fail")

summary(full_formula_lmer_mature)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## log(e_conc_ug) ~ age_at_sample + duckface_sample + scale(mean_rain_last_90_days) +  
##     scale(mean_max_temp_last_30_days) + (1 | ind_id)
##    Data: combined_df_5_days_mature
## 
## REML criterion at convergence: 3631.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8783 -0.6464 -0.0328  0.6695  3.2860 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ind_id   (Intercept) 0.4281   0.6543  
##  Residual             1.7175   1.3105  
## Number of obs: 1054, groups:  ind_id, 31
## 
## Fixed effects:
##                                    Estimate Std. Error t value
## (Intercept)                        0.330900   0.241935   1.368
## age_at_sample                      0.004033   0.016399   0.246
## duckface_sample                    0.394228   0.147981   2.664
## scale(mean_rain_last_90_days)      0.051769   0.061894   0.836
## scale(mean_max_temp_last_30_days) -0.068751   0.047051  -1.461
## 
## Correlation of Fixed Effects:
##             (Intr) ag_t_s dckfc_ s(___9
## age_at_smpl -0.823                     
## duckfc_smpl -0.061  0.027              
## scl(___90_) -0.012  0.080  0.004       
## sc(____30_) -0.114  0.076 -0.076 -0.219
plot_model(full_formula_lmer_mature, type = "est", show.values = TRUE, value.offset = .3, title = "Log(Estradiol)", show.intercept = TRUE)

Model averaging

NOTE; when model averaging with a delta AICc parameter of 2, there is actually only one model that minimizes AIC, and it only includes duckface as a predictor.

So, our best model is actually just log(e_conc_ug) ~ intercept + duckface_sample + (1|ind_id)

dredged_model_lmer_mature <- dredge(full_formula_lmer_mature)
## Warning in dredge(full_formula_lmer_mature): comparing models fitted by REML
## Fixed term is "(Intercept)"
print(dredged_model_lmer_mature)
## Global model call: lmer(formula = log(e_conc_ug) ~ age_at_sample + duckface_sample + 
##     scale(mean_rain_last_90_days) + scale(mean_max_temp_last_30_days) + 
##     (1 | ind_id), data = combined_df_5_days_mature, na.action = "na.fail")
## ---
## Model selection table 
##     (Int) age_at_smp dck_smp scl(men_max_tmp_lst_30_dys)
## 3  0.3468             0.3803                            
## 1  0.3746                                               
## 7  0.3617             0.3944                    -0.05827
## 11 0.3524             0.3788                            
## 5  0.3877                                       -0.04888
## 9  0.3808                                               
## 4  0.3151  0.0028450  0.3803                            
## 15 0.3756             0.3938                    -0.06776
## 2  0.3554  0.0017680                                    
## 13 0.4014                                       -0.05833
## 8  0.3419  0.0018470  0.3943                    -0.05880
## 12 0.3049  0.0042610  0.3788                            
## 6  0.3801  0.0008133                            -0.04951
## 10 0.3445  0.0032780                                    
## 16 0.3309  0.0040330  0.3942                    -0.06875
## 14 0.3695  0.0029220                            -0.05928
##    scl(men_ran_lst_90_dys) df    logLik   AICc delta weight
## 3                           4 -1809.545 3627.1  0.00  0.672
## 1                           3 -1811.858 3629.7  2.61  0.182
## 7                           5 -1810.915 3631.9  4.76  0.062
## 11                 0.02349  5 -1811.375 3632.8  5.68  0.039
## 5                           4 -1813.463 3635.0  7.84  0.013
## 9                  0.02723  4 -1813.659 3635.4  8.23  0.011
## 4                           5 -1812.796 3635.6  8.52  0.009
## 15                 0.04725  6 -1812.515 3637.1  9.98  0.005
## 2                           4 -1815.106 3638.3 11.12  0.003
## 13                 0.04762  5 -1815.056 3640.2 13.04  0.001
## 8                           6 -1814.148 3640.4 13.25  0.001
## 12                 0.02790  6 -1814.590 3641.3 14.13  0.001
## 6                           5 -1816.694 3643.4 16.32  0.000
## 10                 0.03088  5 -1816.875 3643.8 16.68  0.000
## 16                 0.05177  7 -1815.695 3645.5 18.37  0.000
## 14                 0.05137  6 -1818.243 3648.6 21.44  0.000
## Models ranked by AICc(x) 
## Random terms (all models): 
##   1 | ind_id
# averaged_model_lmer_mature <- model.avg(dredged_model_lmer_mature, subset = delta < 2)
Final model
best_model_duckface_lmer <- lmer(log(e_conc_ug) ~  duckface_sample + 
                            (1 | ind_id),
                          data = combined_df_5_days_mature, 
                          na.action = "na.fail")

summary(best_model_duckface_lmer)
## Linear mixed model fit by REML ['lmerMod']
## Formula: log(e_conc_ug) ~ duckface_sample + (1 | ind_id)
##    Data: combined_df_5_days_mature
## 
## REML criterion at convergence: 3619.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8625 -0.6450 -0.0410  0.6742  3.2789 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ind_id   (Intercept) 0.341    0.584   
##  Residual             1.724    1.313   
## Number of obs: 1054, groups:  ind_id, 31
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)       0.3468     0.1244   2.787
## duckface_sample   0.3803     0.1477   2.576
## 
## Correlation of Fixed Effects:
##             (Intr)
## duckfc_smpl -0.083
plot_model(best_model_duckface_lmer, type = "est", show.values = TRUE, value.offset = .3, title = "Log(Estradiol)", show.intercept = TRUE)

Checking model diagnostics

I would appreciate someone else’s opinion, but I am inclined to say that our model passes diagnostics. I am a bit concerned about the heteroskedasticity though, and we already know that our e_data isn’t normal….

# checking model diagnostics
plot(best_model_duckface_lmer) #resids vs leverage -- pretty good, maybe some heteroskedasticity

lattice::qqmath(best_model_duckface_lmer) ## qq norm plot -- pretty good...some tails

plot(best_model_duckface_lmer, # scale location plot 
     sqrt(abs(resid(.)))~fitted(.),
     type=c("p","smooth"), col.line=1)

Pregnancy E

Calculating age at first birth

first_birth_data <- life_history_data |> 
  filter(Event %in% c("DOB", "First Birth")) |> 
  select(ID, Event, Date) |> 
  pivot_wider(names_from = Event, values_from = Date) |> 
  filter(!is.na(`First Birth`)) |>  # Keep only individuals with a First Birth value
  mutate(Age_at_First_Birth = (as.numeric(`First Birth`) -  as.numeric(DOB))/365) 


# Calculate the average age at first birth and its standard error
age_stats <- first_birth_data |> 
  summarize(
    Average_Age = mean(Age_at_First_Birth, na.rm = TRUE),
    Std_Dev_Age = sd(Age_at_First_Birth, na.rm = TRUE),
    Sample_Size = n()
  ) |> 
  mutate(
    Std_Error_Age = Std_Dev_Age / sqrt(Sample_Size)
  )

# Display the results
kable(age_stats)
Average_Age Std_Dev_Age Sample_Size Std_Error_Age
6.190776 1.021683 15 0.2637973

Assigning Gestation Lengths

I am not sure how to go about doing this…E seems to bounce around a LOT throughout gestation. The green dashed line on each plot represents the female’s average non-pregnant E.

e_data <- e_data |> 
  mutate(across(c(B2,B3,B4), ~ na_if(.x, " ")))

e_data$B1 <- as.Date(e_data$B1)
e_data$B2 <- as.Date(e_data$B2)
e_data$B3 <- as.Date(e_data$B3)
e_data$B4 <- as.Date(e_data$B4)

e_data$date <- as.Date(e_data$date)
# # Calculate the difference in months
# 
gestation_df_long <- e_data %>%
   pivot_longer(cols = c(B1, B2, B3, B4), names_to = "birth_number", values_to = "birthing_date")

gestation_df <- gestation_df_long |> 
  mutate(months_from_birth = interval(date, birthing_date) / months(1)) |> 
  select(ind_id, age_at_sample, month, months_from_birth, e_conc_ug, birth_number)

gestating_df<- gestation_df |> 
  filter(months_from_birth >= 0 & months_from_birth <= 9) |> 
  mutate(months_from_birth = -1*months_from_birth)

gestating_df_females<-gestating_df |> 
  distinct(ind_id) |> 
  pull(ind_id) 

not_gestating_df<-gestation_df |> 
  filter(is.na(months_from_birth) | months_from_birth > 9)  |> 
  filter(age_at_sample > 6) |> 
  group_by(ind_id) |> 
  summarise(non_gestating_e2 = mean(e_conc_ug))

  
# Plot the data for each individual
for (individual in gestating_df_females) {
  individual_data <- gestating_df %>%
    filter(ind_id == individual)
  
  # Establish baseline E2 for the individual
  baseline_e <- not_gestating_df |> 
    filter(ind_id == individual) |>
    pull(non_gestating_e2)
  
  p <- ggplot(data = individual_data, aes(x = months_from_birth, y = e_conc_ug, color = birth_number, linetype = birth_number)) +
    geom_point() +
    geom_line() +
    geom_hline(yintercept = baseline_e, linetype = "dashed", color = "green") +
    ggtitle(paste0("Estradiol and gestation for ", individual, " (n = ", nrow(individual_data), " samples)")) +
    xlab("Months from birth") +
    ylab("Fecal estradiol (ug/g)") +
    ylim(0, 150) +
    xlim(-9, 0) +
    theme(legend.position = "right") +
    labs(color = "Birth Number", linetype = "Birth Number")

  print(p)
}

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

E across trimesters

This is just using the trimester estimates that were already in the dataset because, as I say above, I do not know a good way to use our E values to assign gestation lengths. It bounces around SO much!

Step 1. Preparing dataset

Restricting dataset to adults (age_at_sample>mean_age_above_threshold)

pregnant_df<-e_data |> 
  select(ind_id, month, hour, age_at_sample, e_conc_ug, preg_trim, pregnancy_num, is_pregnant) |> 
  filter(age_at_sample > mean_age_above_threshold) |> 
  mutate(preg_trim = factor(preg_trim, 
                             levels = c("", "T1", "T2", "T3"),
                             labels = c("Not pregnant", "T1", "T2", "T3")))

n_pregnant_df <- nrow(pregnant_df)

Step 2. Data visualization

Providing both unchanged and log-transformed visualizations.

par(mfrow = c(1,1))
# non transformed
ggplot(data = pregnant_df, aes(x = preg_trim, y = e_conc_ug))+
  geom_boxplot()+
  geom_jitter(alpha = 0.1)+
  ggtitle(paste("Estradiol and gestation (N = ", n_pregnant_df, "samples)")) +
  xlab("Pregnancy trimester")+
  ylab("Fecal estradiol (ug/g)")+
  scale_x_discrete(labels = c("Not pregnant", "T1", "T2", "T3"))  

# log transformed
ggplot(data = pregnant_df, aes(x = preg_trim, y = log(e_conc_ug)))+
  geom_boxplot()+
  geom_jitter(alpha = 0.1)+
  ggtitle(paste("Estradiol and gestation (N = ", n_pregnant_df, "samples)")) +
  xlab("Pregnancy trimester")+
  ylab("Log(Fecal estradiol [ug/g])")+
  scale_x_discrete(labels = c("Not pregnant", "T1", "T2", "T3"))  

Step 3. Assessing normality

Data is still not normal following log transformation

shapiro.test(pregnant_df$e_conc_ug)
## 
##  Shapiro-Wilk normality test
## 
## data:  pregnant_df$e_conc_ug
## W = 0.27018, p-value < 2.2e-16
shapiro.test(log(pregnant_df$e_conc_ug))
## 
##  Shapiro-Wilk normality test
## 
## data:  log(pregnant_df$e_conc_ug)
## W = 0.99424, p-value = 0.0003329

Step 4. Kruskall-Wallis/Mann Whitney U Tests

Doing non-parametric tests (Kruskall-Wallis, pairwise Mann Whitney U Tests) to see how E varies from non-pregnant individuals to pregnant individuals across the three trimesters.

We can see pregnancy status is significantly associated with E according to the Kruskall-Wallis test (chi-squared = 196.85, df = 3, p<0.001).

Pregnant E is significantly greater than non-pregnant E for all three trimesters (all p<0.001). Among trimesters of pregnancy, we can see that T2<T3 (p<0.01), but no other significant results.

kruskal_result <- kruskal.test(e_conc_ug ~ preg_trim, data = pregnant_df)
print(kruskal_result)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  e_conc_ug by preg_trim
## Kruskal-Wallis chi-squared = 196.85, df = 3, p-value < 2.2e-16
# Pairwise comparisons (if the Kruskal-Wallis test is significant)
pairwise_results <- pairwise.wilcox.test(pregnant_df$e_conc_ug, 
                                         pregnant_df$preg_trim, 
                                         p.adjust.method = "bonferroni")
print(pairwise_results)
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  pregnant_df$e_conc_ug and pregnant_df$preg_trim 
## 
##    Not pregnant T1     T2    
## T1 1.1e-13      -      -     
## T2 3.0e-11      0.6410 -     
## T3 < 2e-16      0.8521 0.0045
## 
## P value adjustment method: bonferroni

Lactation E

Step 1. Preparing dataset for lactating vs not lactating

Restricting dataset to adults (age_at_sample > mean_age_above_threshold)

lactation_df <- e_data |> 
  select(ind_id, month, hour, age_at_sample, e_conc_ug, lactating) |> 
  filter(age_at_sample > mean_age_above_threshold)

n_lactation_df<-nrow(lactation_df)

Step 2. Data visualization

Providing both unchanged and log-transformed visualizations.

# non transformed
ggplot(data = lactation_df, aes(x = lactating, y = e_conc_ug))+
  geom_boxplot()+
  geom_jitter(alpha = 0.1)+
  ggtitle(paste("Estradiol and lactation (N = ", n_pregnant_df, "samples)")) +
  xlab("Lactating")+
  ylab("Fecal estradiol (ug/g)")

# log transformed
ggplot(data = lactation_df, aes(x = lactating, y = log(e_conc_ug)))+
  geom_boxplot()+
  geom_jitter(alpha = 0.1)+
  ggtitle(paste("Estradiol and lactation (N = ", n_pregnant_df, "samples)")) +
  xlab("Lactating")+
  ylab("log(Fecal estradiol [ug/g])")

Step 3. Assessing normality

Data is still not normal following log transformation

shapiro.test(lactation_df$e_conc_ug)
## 
##  Shapiro-Wilk normality test
## 
## data:  lactation_df$e_conc_ug
## W = 0.27018, p-value < 2.2e-16
shapiro.test(log(lactation_df$e_conc_ug))
## 
##  Shapiro-Wilk normality test
## 
## data:  log(lactation_df$e_conc_ug)
## W = 0.99424, p-value = 0.0003329

Step 4. Mann Whitney U Test

Data is not normal, so can’t do ANOVA. Let’s use a nonparametric approach.

We can see that E is significantly lower during lactation (W = 138521, p < 0.001).

wilcox.test(log(e_conc_ug) ~ lactating, data = lactation_df)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  log(e_conc_ug) by lactating
## W = 138521, p-value = 4.561e-10
## alternative hypothesis: true location shift is not equal to 0

E across life history stages

Let’s do a box plot looking at E across the lifecourse:

  1. Juvenility (age_at_sample< mean_age_above_threshold)
  2. Cycling (is_pregnant = “N”)
  3. Pregnant (is_pregnant = “Y”)
  4. Lactating (lactating = “YES”)

I prefer the log-transformed plots, but I’ll provide both transformed and not-transformed. Let me know what you prefer!

# Prepare the data
e_data_with_lifestage <- e_data |> 
  mutate(
    life_stage = case_when(
      age_at_sample < mean_age_above_threshold ~ "Juvenility",
      lactating == "YES" & age_at_sample > mean_age_above_threshold  ~ "Lactating",
      is_pregnant == "NP"  & age_at_sample > mean_age_above_threshold & lactating != "YES" ~ "Cycling",
      is_pregnant == "P" ~ "Pregnant",
      TRUE ~ NA_character_  # Handle any other cases if necessary
    )
  ) 

# Create the box plot with non-transformed e
ggplot(data = e_data_with_lifestage, aes(x = life_stage, y = e_conc_ug)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.1) +
  ggtitle("Estradiol Concentration Across the Lifecourse") +
  xlab("Life Stage") +
  ylab("Fecal Estradiol (ug/g)") +
  scale_x_discrete(limits = c("Juvenility", "Cycling", "Pregnant", "Lactating"))

# Create the box plot with log-transformed e
ggplot(data = e_data_with_lifestage, aes(x = life_stage, y = log(e_conc_ug))) +
  geom_boxplot() +
  geom_jitter(alpha = 0.1) +
  ggtitle("Estradiol Concentration Across the Lifecourse") +
  xlab("Life Stage") +
  ylab("Log(Fecal Estradiol [ug/g])") +
  scale_x_discrete(limits = c("Juvenility", "Cycling", "Pregnant", "Lactating"))